Introduction:

The goal of this project is to use R and Tableau dashboard to display the New York City flight dataset and acquire insights on how to reduce departure delay, one of the most addressed issues encountered by all stakeholders involved in the aviation supply chain. John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR), and LaGuardia Airport (LGA) in New York City are all incredibly crowded. PANYNJ got in touch with us to see whether we could comprehend the issue and resolve it. We think that departure delays are affected by the four most important elements - weather, flight patterns, carrier and manufacturer. In order to quantify, we select several departure delay indicators to display, such as average departure delay time (in minutes), departure delay occurrences, and departure delay percentage.

To better understand the connection between weather, manufacturing and flight patterns and carrier, our business analytics team has posed the three questions listed below.

Q: Is there any carrier or manufacturer that performs significantly better than the rest?

Carrier Analysis

Loading the libraries:

library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(corrplot)
library(xray)

Reading the files:

airlines <- read.csv("airlines.csv")
airports <- read.csv("airports.csv")
flights <- read.csv("flights.csv")
planes <- read.csv("planes.csv")
weather <- read.csv("weather.csv")

Setting up data frames to use later:

allflights <- flights %>%
  select(carrier,hour,origin,dep_delay,time_hour,month)

flights_planes <- left_join(flights,planes,by = "tailnum")

Which carrier has the most flights?

allflights %>%
  group_by(carrier) %>%
  summarise(count=n()) %>%
  ggplot(aes(y=count,x=reorder(carrier,-count))) +
  geom_bar(fill = "blue",color=" gray",stat="identity") + 
  ylab("Number of Flights") +
  xlab("Carriers") +
  ggtitle("Total Number of Flights by Carriers ") + 
  theme(plot.title = element_text(hjust = 0.5,size=15))

By looking at the graph, we can see that UA has the most number of flights.

Filtering the top ten airlines to get unbiased data:

If we take the airlines with less sample size i.e. less number of flights, the results would be indecisive. Therefore, we will have to consider the top ten airlines as they have the significant number of flights.

allflights %>%
  filter(carrier %in% c('UA','EV','B6','DL','AA','MQ','US','9E','WN','VX')) %>%
  ggplot( aes(y = dep_delay, x = reorder(carrier, dep_delay, na.rm=TRUE, FUN="median"))) +       geom_boxplot() + 
  ylab("Departure Delay")+ 
  xlab("Carriers") +
  ggtitle("Departure Delay by Carriers ") + 
  theme(plot.title = element_text(hjust = 0.5,size=15))

It is observed that DL, AA and MQ have large number of flights having delay greater than 700 min.

Average Departure Delay:

allflights %>%
  filter(carrier %in% c('UA','EV','B6','DL','AA','MQ','US','9E','WN','VX'))%>%
  filter(dep_delay>0) %>%
  group_by(carrier) %>%
  summarise(average_departure_delay= mean(dep_delay,na.rm=TRUE),count=n()) %>%
  ggplot( aes(y = average_departure_delay, x = reorder(carrier,-count))) +
  geom_point(size=2,color="blue") +
  xlab("Carriers") + 
  ylab("Departure Delay in min") + 
  ggtitle("Average Departure Delay(in min) by Top Ten Carriers") + 
  theme(plot.title = element_text(hjust = 0.5,size=15))

The graph shows that despite being the busiest carrier, UA has the shortest departure delay.

Manufacturer Analysis

Setting up the data frames for manufacturer analysis:

# All manufacturers
total_manu <- flights_planes %>%
  filter(!is.na(manufacturer)) %>% 
  group_by(manufacturer) %>%
  summarise(total_count=n(),average_dep_delay = mean(dep_delay, na.rm =TRUE)) %>%
  arrange(desc(total_count)) 

#Delayed Manufacturers
delay_manu <- flights_planes %>%
  filter(dep_delay>0) %>%
  filter(!is.na(manufacturer)) %>%
  group_by(manufacturer) %>%
  summarise(total_delay_count=n(),average_dep_delay = mean(dep_delay, na.rm =TRUE)) %>%
  arrange(desc(total_delay_count))

#Left Join between Total Manufacturers and Delayed Manufacturers
summary_manu <- left_join(total_manu,delay_manu, by="manufacturer") %>% 
  mutate(percent_delay=100*total_delay_count/total_count )
total_manu %>%
  ggplot(aes(x = total_count, y = reorder(manufacturer, total_count), fill=manufacturer )) +
  geom_bar(width=0.7, stat = "identity", fill="blue",color="gray") +
  theme_bw(base_line_size = 0, base_size = 7) +
  xlab("Number of Flights") +
  ylab("Manufacturers") + 
  ggtitle("Manufacturers by Total Number of Flights") + 
  theme(plot.title = element_text(hjust = 0.3,size=15))

As we can see from the graph above, only top five manufacturers have significant flights. If we consider all the manufacturers in our analysis, results would be inconclusive as the sample size for other manufacturers(other than top five) is remarkably less. Therefore, we will only consider top five manufacturers i.e. Boeing, Embraer, Airbus Industrie, Bombardier Inc and Mcdonnell Douglas Aircraft Co.

summary_manu %>%
  filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER        INC','CANADAIR')) %>%
  ggplot(aes(y = percent_delay, x = reorder(manufacturer,total_count), fill=manufacturer))+
  geom_bar(width=0.7, stat = "identity",fill="blue",colour="gray") +
  theme_bw(base_line_size = 0, base_size = 7) + 
  xlab("Manufacturers") + 
  ylab("Percentage Delay") + 
  ggtitle("Manufacturers by Percentage Delay") + 
  theme(plot.title = element_text(hjust = 0.5,size=15))

total_manu %>%
  filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC'   ,'CANADAIR'))%>%
  ggplot(aes(y = average_dep_delay,x = reorder(manufacturer,total_count),fill=manufacturer))+
  geom_bar(width=0.7, stat = "identity",fill="blue",colour="gray") +
  theme_bw(base_line_size = 0, base_size = 7) +
  xlab("Manufacturers") + 
  ylab("Average Departure Delay") + 
  ggtitle("Manufacturers by Average Departure Delay") + 
  theme(plot.title = element_text(hjust = 0.5,size=15))

From the above two graphs, we can see that Airbus Industrie has the least departure delay.

Q: Is there any pattern in which delay occurs?

Summarize the data to get to know more about it.

summary(airlines)
##    carrier              name          
##  Length:16          Length:16         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character
summary(airports)
##      faa                name                lat             lon         
##  Length:1458        Length:1458        Min.   :19.72   Min.   :-176.65  
##  Class :character   Class :character   1st Qu.:34.26   1st Qu.:-119.19  
##  Mode  :character   Mode  :character   Median :40.09   Median : -94.66  
##                                        Mean   :41.65   Mean   :-103.39  
##                                        3rd Qu.:45.07   3rd Qu.: -82.52  
##                                        Max.   :72.27   Max.   : 174.11  
##       alt                tz              dst               tzone          
##  Min.   : -54.00   Min.   :-10.000   Length:1458        Length:1458       
##  1st Qu.:  70.25   1st Qu.: -8.000   Class :character   Class :character  
##  Median : 473.00   Median : -6.000   Mode  :character   Mode  :character  
##  Mean   :1001.42   Mean   : -6.519                                        
##  3rd Qu.:1062.50   3rd Qu.: -5.000                                        
##  Max.   :9078.00   Max.   :  8.000
summary(flights)
##        ID              year          month             day       
##  Min.   :     1   Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 81837   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median :163674   Median :2013   Median : 7.000   Median :16.00  
##  Mean   :163674   Mean   :2013   Mean   : 6.565   Mean   :15.74  
##  3rd Qu.:245510   3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :327346   Max.   :2013   Max.   :12.000   Max.   :31.00  
##     dep_time    sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   :   1   Min.   : 500   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 907   1st Qu.: 905   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1122  
##  Median :1400   Median :1355   Median :  -2.00   Median :1535   Median :1554  
##  Mean   :1349   Mean   :1340   Mean   :  12.56   Mean   :1502   Mean   :1533  
##  3rd Qu.:1744   3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1944  
##  Max.   :2400   Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:327346      Min.   :   1   Length:327346     
##  1st Qu.: -17.000   Class :character   1st Qu.: 544   Class :character  
##  Median :  -5.000   Mode  :character   Median :1467   Mode  :character  
##  Mean   :   6.895                      Mean   :1943                     
##  3rd Qu.:  14.000                      3rd Qu.:3412                     
##  Max.   :1272.000                      Max.   :8500                     
##     origin              dest              air_time        distance   
##  Length:327346      Length:327346      Min.   : 20.0   Min.   :  80  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 509  
##  Mode  :character   Mode  :character   Median :129.0   Median : 888  
##                                        Mean   :150.7   Mean   :1048  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##       hour           minute       time_hour        
##  Min.   : 5.00   Min.   : 0.00   Length:327346     
##  1st Qu.: 9.00   1st Qu.: 8.00   Class :character  
##  Median :13.00   Median :29.00   Mode  :character  
##  Mean   :13.14   Mean   :26.23                     
##  3rd Qu.:17.00   3rd Qu.:44.00                     
##  Max.   :23.00   Max.   :59.00
summary(planes)
##    tailnum               year          type           manufacturer      
##  Length:3322        Min.   :1956   Length:3322        Length:3322       
##  Class :character   1st Qu.:1997   Class :character   Class :character  
##  Mode  :character   Median :2001   Mode  :character   Mode  :character  
##                     Mean   :2000                                        
##                     3rd Qu.:2005                                        
##                     Max.   :2013                                        
##                     NA's   :70                                          
##     model              engines          seats           speed      
##  Length:3322        Min.   :1.000   Min.   :  2.0   Min.   : 90.0  
##  Class :character   1st Qu.:2.000   1st Qu.:140.0   1st Qu.:107.5  
##  Mode  :character   Median :2.000   Median :149.0   Median :162.0  
##                     Mean   :1.995   Mean   :154.3   Mean   :236.8  
##                     3rd Qu.:2.000   3rd Qu.:182.0   3rd Qu.:432.0  
##                     Max.   :4.000   Max.   :450.0   Max.   :432.0  
##                                                     NA's   :3299   
##     engine         
##  Length:3322       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
summary(weather)
##     origin               year          month             day       
##  Length:26115       Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :2013   Median : 7.000   Median :16.00  
##                     Mean   :2013   Mean   : 6.504   Mean   :15.68  
##                     3rd Qu.:2013   3rd Qu.: 9.000   3rd Qu.:23.00  
##                     Max.   :2013   Max.   :12.000   Max.   :31.00  
##                                                                    
##       hour            temp             dewp           humid       
##  Min.   : 0.00   Min.   : 10.94   Min.   :-9.94   Min.   : 12.74  
##  1st Qu.: 6.00   1st Qu.: 39.92   1st Qu.:26.06   1st Qu.: 47.05  
##  Median :11.00   Median : 55.40   Median :42.08   Median : 61.79  
##  Mean   :11.49   Mean   : 55.26   Mean   :41.44   Mean   : 62.53  
##  3rd Qu.:17.00   3rd Qu.: 69.98   3rd Qu.:57.92   3rd Qu.: 78.79  
##  Max.   :23.00   Max.   :100.04   Max.   :78.08   Max.   :100.00  
##                  NA's   :1        NA's   :1       NA's   :1       
##     wind_dir       wind_speed         wind_gust         precip        
##  Min.   :  0.0   Min.   :   0.000   Min.   :16.11   Min.   :0.000000  
##  1st Qu.:120.0   1st Qu.:   6.905   1st Qu.:20.71   1st Qu.:0.000000  
##  Median :220.0   Median :  10.357   Median :24.17   Median :0.000000  
##  Mean   :199.8   Mean   :  10.518   Mean   :25.49   Mean   :0.004469  
##  3rd Qu.:290.0   3rd Qu.:  13.809   3rd Qu.:28.77   3rd Qu.:0.000000  
##  Max.   :360.0   Max.   :1048.361   Max.   :66.75   Max.   :1.210000  
##  NA's   :460     NA's   :4          NA's   :20778                     
##     pressure          visib         time_hour        
##  Min.   : 983.8   Min.   : 0.000   Length:26115      
##  1st Qu.:1012.9   1st Qu.:10.000   Class :character  
##  Median :1017.6   Median :10.000   Mode  :character  
##  Mean   :1017.9   Mean   : 9.255                     
##  3rd Qu.:1023.0   3rd Qu.:10.000                     
##  Max.   :1042.1   Max.   :10.000                     
##  NA's   :2729

To see how many flights are causing delay.

Clearing the data:

Remove the cancelled flights and flights that were early. Calculate the average departure delay

#remove all **NA** values from flights dataset.
not_cancelled <- flights %>% 
  filter(!is.na(arr_delay), !is.na(dep_delay))

not_cancelled
delayed_flights_summary <- not_cancelled %>% 
  group_by(tailnum) %>%
  summarize(
    count = n(),
    avg_dep_delay = mean(dep_delay)
  )

delayed_flights_summary

Plotting a graph between the number of flights and the average departure delay.

ggplot(delayed_flights_summary, aes(avg_dep_delay, count)) + 
  geom_point() +
  scale_y_continuous(breaks= seq(0,600,length.out=7))+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_vline(xintercept = mean(delayed_flights_summary$avg_dep_delay), color = "red")+
  labs(title = "Average Delay Vs No. of flights", x = "Avgerage Departure Delays(in minutes)", y = "Number of flights")

It is noticed that Most of the flights have delay below 50 minutes Max delay is 300 minutes. Mean of departure delay is around 13 minutes

Sorting the data into different parts of the day and removing the less useful data.

flight_data <- flights %>% drop_na(dep_delay)
flight_data <- flights %>% drop_na(sched_dep_time)

flight_data<-mutate(flights, timeOfDay = ifelse(sched_dep_time %in% 600:1159, "Morning",
                                                ifelse(sched_dep_time %in% 1200:1759, "Afternoon",
                                                       ifelse(sched_dep_time %in% 1800:2359, "Evening","Night"))))

To notice the delay variation throughout the day

Plotting a box-plot to get to know how the delay is in different parts of the day.

ggplot(data = flight_data, aes(x = dep_delay, y = c(timeOfDay))) +
  geom_boxplot(color = "blue")+
  scale_x_continuous(breaks= seq(0,1300,length.out=14))+
  labs(y = "Time of a Day", x = "Departure delay", title = "Distribution of Departure Delay during a day")+
  theme(plot.title = element_text(hjust = 0.5))

Evenings seem to have most Departure Delay In all the other times of the day, most of the flights have least delay.(Close to zero)

Splitting the data into different airports in New York to see if there is any pattern.

EWR_data<- flights %>% filter(flights$origin == "EWR"  &  flights$dep_delay>0)
LGA_data<- flights %>% filter(flights$origin == "LGA"  &  flights$dep_delay>0)
JFK_data<- flights %>% filter(flights$origin == "JFK"  &  flights$dep_delay>0)

EWR_data<-mutate(EWR_data, Month=month.name[EWR_data$month] )
LGA_data<-mutate(LGA_data, Month=month.name[LGA_data$month] )
JFK_data<-mutate(JFK_data, Month=month.name[JFK_data$month] )

flight_data<-mutate(flight_data, Month=month.name[flight_data$month] )

To notice the departure delay in each month.

A Box-plot per month.

For all flights:

ggplot (data = flight_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) + 
  geom_boxplot(fill = "orange") +
  ylim(0,900)+
  labs(x = "Month", y = "Departure Delays(in minutes)", title = "Delay distribution during each month")+
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 183142 rows containing non-finite values (stat_boxplot).

At EWR Airport:

ggplot (data = EWR_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
  geom_boxplot(fill = "orange") +
  ylim(0,900)+
  labs(x = "Month", y = "Departure Delays(in minutes)", title = "At EWR - Delay distribution during each month")+
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

At LGA Airport:

ggplot (data = LGA_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
  geom_boxplot(fill = "orange") +
  ylim(0,900)+
  labs(x = "Month", y = "Departure Delays(in minutes)", title = "At LGA - Delay distribution during each month")+
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

At JFK Airport:

ggplot (data = JFK_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
  geom_boxplot(fill = "orange") +
  ylim(0,900)+
  labs(x = "Month", y = "Departure Delays(in minutes)", title = "At JFK - Delay distribution during each month")+
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

June and July have the maximum delays in all the airports.

To check the pattern in which seasons are effecting the delay.

Dividing the data into seasons:

flight_data <- flight_data %>%
  mutate(season = ifelse(month %in% 9:11, "Fall",
                         ifelse(month %in% 6:8, "Summer",
                                ifelse(month %in% 3:5, "Spring",
                                       "Winter"))))

planes_with_delays <- flight_data %>% 
  filter(dep_delay>0) %>%
  select(carrier,hour, origin,month, season, dep_delay,time_hour,tailnum, distance,timeOfDay)

Plotting the seasonal effect on delay

ggplot(data=planes_with_delays, aes(x=season, y=dep_delay, fill= timeOfDay))+ geom_col()+
  labs(title = "Delay per season",x=" Season", y="Delay") +
  scale_fill_brewer(palette="Spectral")+
  theme(plot.title = element_text(hjust = 0.5))

Q: How weather affects departure delays ?

In this part, we try to analyze relationship between weather factors (humidity, visibility, precipitation, and wind speed) and departure delay (average delay and delay percent).

Firstly, we try to see if there is null value in weather dataset

anomalies(read_csv('weather.csv'))

df are created as a new dataframe by joining flights and weather dataset and selecting a number of parameters that related to this analysis .For some variables which are wind_dir, temp, humid, and wind_speed, there are little percent of null values (< 5%). Consequently, null values from these columns are exclude. Meanwhile, wind_gust has 79.56% of null values from total data which means that there are not enough data in this column, so we exclude it as well. We also create new column called ‘delay’ which indicates whether each flight is delay or not delay. If departure delay more than 0, we define it as 1 (delay flight). Otherwise, it is 0 (not delay flight). Inner-join is executed to flights and weather data by time hour and origin to create df.

flights <-read_csv('flights.csv') %>%
  select(carrier,hour, origin,month, dep_delay,time_hour) %>%
  mutate(delay = case_when(dep_delay>0 ~ 1,
                           dep_delay<=0 ~ 0))
weather <-read_csv('weather.csv') %>%
  filter(!is.na(temp),!is.na(humid),!is.na(wind_dir),!is.na(wind_speed))%>%
  select(temp,wind_speed,precip,time_hour,origin,
                              visib,dewp,humid,wind_dir,pressure)

df <- inner_join(flights,weather,by=c('time_hour','origin'))
anomalies(df)  

For pressure, null values are replaced by the mean of total pressure.

df$pressure[is.na(df$pressure)] <- mean(df$pressure,na.rm=TRUE)
anomalies(df)

After that, correlation matrix are plotted to see if there is any correlation between departure delay and weather factors. As you can see, dewpoint variable is highly correlated with temperature variable (89%). For this reason, dewp is excluded.

library(ggcorrplot)
delay_corr_plot <-
  df%>%
  filter(delay == 1) %>%
  select(dep_delay,temp,wind_speed,precip,visib,dewp,humid,wind_dir,pressure)%>%
  na.omit
delay_cor <- cor(delay_corr_plot) 
ggcorrplot(delay_cor, hc.order = TRUE, type = "lower",
           lab = TRUE, outline.col = "white",
           ggtheme = ggplot2::theme_gray,
           colors = c("#6D9EC1", "white", "#E46726"), lab_size=3, tl.cex = 10)

Multiple linear regression model is appply to weather data to inspect variable that is not significant when trying to predict departure delay by using a constraint that p-value of each parameter has to be less than 0.05. From the result, wind_dir is exclude as well.

fit <-glm(dep_delay ~., data =delay_corr_plot )
summary(fit)
## 
## Call:
## glm(formula = dep_delay ~ ., data = delay_corr_plot)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
##  -94.58   -30.64   -17.50    11.06  1268.60  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 434.402083  25.232017  17.216  < 2e-16 ***
## temp          0.722589   0.085760   8.426  < 2e-16 ***
## wind_speed    0.466188   0.029497  15.804  < 2e-16 ***
## precip       41.276196   4.367207   9.451  < 2e-16 ***
## visib        -0.367232   0.096624  -3.801 0.000144 ***
## dewp         -0.587777   0.091906  -6.395 1.61e-10 ***
## humid         0.589664   0.044960  13.115  < 2e-16 ***
## wind_dir     -0.003123   0.001716  -1.820 0.068709 .  
## pressure     -0.441851   0.024148 -18.298  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2840.981)
## 
##     Null deviance: 363813974  on 123924  degrees of freedom
## Residual deviance: 352043033  on 123916  degrees of freedom
## AIC: 1337135
## 
## Number of Fisher Scoring iterations: 2

Apart from above analysis, we assume that temperature does not affect delay directly so that there will be 5 factors taking into consideration which are humidity, visibility, wind speed, precipitation, and pressure.

Humidity

Monthly Trend of Humidity

First, monthly trend of the factor is plotted to analyze seasonal trend.

df%>%
   group_by( month)%>%
   summarise(average_humidity = mean(humid))%>%
   ggplot(aes(x=month, y=average_humidity))+
    geom_line()+
    scale_x_continuous(breaks = seq(1,12,by=1))+
    labs(title = "Monthly Trend of Humidity",
         y = "Relative Humidity (%)",
         x = "Month")+
    theme_bw()+
    theme(plot.title = element_text(hjust = 0.5))

There are high level of humidity in May to July and it drastically decreased in April and November. However, there is no apparent seasonal trend.

Delay and Humidity

To analyze the relationship between average departure delay and humidity, scatter plot is generated to observe the trend. X-axis demonstrates relative humidity which is cut into sections of 5. Average delay is based on y-axis while colour of each plot represents number of flights in that level of humidity.

humidity <- df %>%
  group_by(humid = cut(humid,breaks= seq(0,100, by =10))) %>%
  summarise(
    density = n(),
    average_delay = mean(dep_delay),
    delay_percent = mean(delay)*100)
humidity %>%
  ggplot(aes(x=humid, y= average_delay)) +geom_point(aes(color=density),size=5) +ylab("Average Delay (min)")+ggtitle("Average delay vs humid with density plot")+
  scale_color_gradient(low = 'blue',high = 'red')+xlab('Relative Humidity (%)')

From above graph, an increasing average delay trend can be seen as relative humidity is increasing. Most flights have level of humidity at 35 - 50 percent whereas the delay is approximately 10 minutes.

humidity %>%
  ggplot(aes(x=humid, y= delay_percent,fill=density)) +geom_col() +ylab("Delay percent")+ggtitle("Delay percent vs humidity")+
  scale_fill_gradient(low = 'blue',high = 'red')+xlab('Relative Humidity (%)')

For delay percent, there is an increasing trend compared to humidity when looking at the bar chart. y-axis represents delay percent while colour of the bar graphs display the number of flights for each relative humidity range.

Visibility

Monthly trend of visibility

Monthly trend of visibility is plotted

df%>%
   group_by( month)%>%
   summarise(average_visibility = mean(visib))%>%
   ggplot(aes(x=month, y=average_visibility))+
    geom_line()+
    scale_x_continuous(breaks = seq(1,12,by=1))+
    labs(title = "Monthly Trend of Visibility",
         y = "Visibility (miles)",
         x = "Month")+
    theme_bw()+
    theme(plot.title = element_text(hjust = 0.5))

There are high visibility for most time of the year except in January, Mar, and December. There is no apparent seasonal trend.

Delay and Visibility

Two graph is plot which are relationship between visibility and average delay, delay percent. In these two plots. scatter plot is used to analyzed the trend with trend line in order to make it easier to find overall trend. y-axis presents delay percent and average delay while x-axis presents visibility in mile.

visibility <- df %>%
  group_by(visib) %>%
  summarise(
    density = n(),
    average_delay = mean(dep_delay),
    delay_percent = mean(delay)*100)
visibility %>% ggplot(aes(x=visib, y=delay_percent)) +geom_point()+geom_smooth(method = 'lm')+ggtitle("Average delay vs visibility")+ylab("Average delay") +xlab('Visibility (mile)')

visibility %>% ggplot(aes(x=visib, y=average_delay)) +geom_point()+geom_smooth()+ggtitle("Delay Percent vs visibility")+ylab("Delay percent")+xlab('Visibility (mile)')

From the graphs, we can see the decreasing trends in average delay and delay percent compared to visibility which is make sense. If the visibility is low, there should be more delay since it is harder to control the plane. For delay percent, it drastically decrease in 0-2.5 visibility range. It might be the range that visiblity hugely affects the delay.

Wind Speed

Monthly trend of wind speed

Monthly trend of wind speed is plotted. Wind speed is highest in March and lowest in August.

df%>%
   group_by( month)%>%
   summarise(average_windspeed = mean(wind_speed))%>%
   ggplot(aes(x=month, y=average_windspeed))+
    geom_line()+
    scale_x_continuous(breaks = seq(1,12,by=1))+
    labs(title = "Monthly Trend of Wind Speed",
         y = "Average Speed (mph)",
         x = "Month")+
    theme_bw()+
    theme(plot.title = element_text(hjust = 0.5))

Delay and Wind speed

For wind speed, scatter plots are demonstrated which is the same as visibility graphs while y-axis displays the delays and x-axis demonstrates wind speed in mile per hour. However, when compared to visibility, opposite trend is given. Average delay and delay percent tend to increase as the wind speed increases from 0-25 mph. After 25 mph, there are fluctuation of delays which is hard to indicate what trend it is.

wind_speed <- df %>%
  group_by(wind_speed) %>%
  summarise(
    density = n(),
    average_delay = mean(dep_delay),
    delay_percent = mean(delay)*100)
wind_speed %>% 
  ggplot(aes(x=wind_speed, y=average_delay)) +geom_point()+geom_smooth(method = 'lm')+ggtitle("Average delay vs Wind Speed (mph)")+ylab("Average Delay (minute)")+xlab('Wind Speed')

wind_speed %>%
  ggplot(aes(x=wind_speed, y=delay_percent)) +geom_point()+geom_smooth(method = 'lm')+ggtitle("Delay Percent vs wind Speed (mph)")+ylab("Delay Percent") +xlab('Wind Speed')

Because of the fluctuation of delay, we create a histogram to investigate on number of flights and wind speed.

df %>% ggplot(aes(x=wind_speed)) +  geom_histogram(color="darkblue", fill="lightblue") + ggtitle("Number of flights vs Wind speed") + ylab("Number of flights")+
  xlab('Wind Speed')

It can be seen that there are not many flights experiencing wind speed more than 20 mile per hour which is the range of delay fluctuation. This means that we do not have enough information to analyze overall trend that wind speed is more than 20 so it fluctuates. For the visibility, most of the flights have high visilibity, consequently, there are fluctuation of data in low visibility range.

Precipitation

Monthly trend of precipitation

Monthy trend of precipitation is plotted.

df%>%
   group_by(month)%>%
   summarise(average_precipitation = mean(precip))%>%
   ggplot(aes(x=month, y=average_precipitation))+
    geom_line()+
    scale_x_continuous(breaks = seq(1,12,by=1))+
    labs(title = "Monthly Trend of Precipitation",
         y = "Precipitation (inch)",
         x = "Month")+
    theme_bw()+
    theme(plot.title = element_text(hjust = 0.5))

Delay and Precipitation

From the graph, precipitation is the highest at June which is reasonable since humidity is high in June as well. From the analysis in first part, there are many delays in June. For this reason, precipitation and humidity may affect delay of flights directly.

Focusing on the relationship with precipitation, there are an small increasing trend for average delay and delay percent from precipitation between 0 - 0.5 inch. Beyond that, there are fluctuation in data but the trendlines of both graphs still rise as an upward trend.

precipitation <- df %>%
  group_by(precip) %>%
  summarise(
    density = n(),
    average_delay = mean(dep_delay),
    delay_percent = mean(delay)*100)
precipitation %>% ggplot( aes(x = precip, y = average_delay)) +
  labs(x = "Precipitation (inch)", y = "Average Delay (min)") +
  stat_smooth() +
  geom_point(color = "gray40", alpha = .5) + ggtitle('Average Delay vs Precipitation')

precipitation %>% ggplot( aes(x = precip, y = delay_percent)) +
  labs(x = "Precipitation (inch)", y = "Delay Percent") +
  stat_smooth() +
  geom_point(color = "gray40", alpha = .5)+ggtitle('Delay Percent vs Precipitation')

Pressure

We create scatterplot between pressure and delay which has x-axis as pressure and y-axis as average delay and delay percent. The trends are display as same as the trends in percipitation analysis. It is overall upward trend with a fluctuation when pressure in more than 0.5 mbars.

pressure <- df %>%
  group_by(pressure) %>%
  summarise(
    density = n(),
    average_delay = mean(dep_delay),
    delay_percent = mean(delay)*100)
precipitation %>% ggplot( aes(x = precip, y = average_delay)) +
  labs(x = "Pressure (millibars)", y = "Average Delay (min)") +
  stat_smooth() +
  geom_point(color = "gray40", alpha = .5) + ggtitle('Average Delay vs Pressure')

precipitation %>% ggplot( aes(x = precip, y = delay_percent)) +
  labs(x = "Pressure(millibars)", y = "Delay Percent") +
  stat_smooth() +
  geom_point(color = "gray40", alpha = .5)+ ggtitle('Delay Percent vs Pressure')

Summary:

  1. Although having a significant number of flights, Carrier UA has the least departure delay and carriers DL, AA and MQ have the more departure delays. Carrier-specific delays happen because of Air Traffic Control issues, Passenger Issues, Crew Issues or maybe maintenance issues such as the mechanic have noticed some unexpected concern which could increase the departure delay time.

  2. Regarding the manufacturer analysis, Airbus Industrie has the least departure delay. This could be due to many factors such as specifications of engine, structure of the plane, etc.

  3. From the above graphs, it is noticed that departure delays in the given data of flight information in 2013 occur more during spring and summer, especially during the months of June and July (Yahoo news,2013). As we know spring and summer seasons are prone to thunderstorms which cause a lot of issues with visibility, wind speed and taking of flights. Also from the report given by the US Department of transportation, We know that 7.39% of the delay was caused by aviation system delay in June whereas it was 6.16% in July.10.43% of the delay was caused by the late arrival of aircraft in June, while it was 9.01% in July. Some of the delays were caused by maintenance and crew problems, 7.06% from June and 6.59% from July. Out of all the late flights,41.80% of June are caused by extreme weather or National aviation system delay and 37.91% of July (US Department of transportation, 2014).

  4. Also, In the time of the day graph, we notice that evenings have the most delay when compared to the rest of the day. Delay in the evenings is usually caused by a few factors. Which are Air traffic, Air traffic control, maintenance, baggage loading and also fuel refilling. In the evening, it is said that a lot of the delay is also caused due to the shift change of employees. And also, the Delay of one flight has led to delays in the next flights (Forbes,2019).

  5. From weather analysis, there are apparent effects to departure delay by weather factors. As for humidity, wind speed, precipitation, and pressure, they tend to make more departure delays if their value increase. In the other hand, high visibility tend to lessen flight delay. Especially for precipitation and humidity, as they have their highest value and in June, and flight delay are increasing during that month as well. So these two factors could affect departure delay directly.

  6. We are aware that delays brought on by bad or extreme weather are inevitable. However, the other elements, such as air traffic, maintenance, luggage loading, fuel replenishment, and delays brought on by staff members changing shifts, are reducible. Airlines can enhance the system to keep flights and luggage loading running smoothly, as well as properly changing personnel shifts so that the flights and airport are not disrupted. To pinpoint accurately which element is causing the delay, we need further information.

Contribution:

Tableau Dashboard:

We split the workings of dashboard

  • Siwach : Departure airport, Day, Month, Weather, Number of flights and delays

  • Suraj: Carrier graph and information about the graphs.

  • Harshini: Average delay of each destination

R markdown:

We split our EDA equally into three questions.

  • Siwach: Answer for the question ‘How weather affects departure delay?’

  • Suraj: Answer for the question ‘Is there any carrier or manufacturer that performs significantly better than the rest?’

  • Harshini: Answer for the question ‘Is there any pattern in which delay occurs?’

Reference:

1.Forbes, 2019. Most Flight Delays? In The Evening And Unrelated To Bad Weather [Online]. Available from: https://www.forbes.com/sites/garystoller/2019/06/03/most-flight-delays-in-the-evening-and-unrelated-to-bad-weather/ [Accessed on 06 December 2022].

2.US Department of transportation,2014. July Airline On-Time Performance Up, Cancellations Down From Previous Year, June [Online]. Available from: https://www.transportation.gov/briefing-room/july-airline-time-performance-cancellations-down-previous-year-june [Accessed on 06 December 2022].

3.Yahoo news,2013. Flight delays pile up amid FAA budget cuts [Online].Available from: https://news.yahoo.com/flight-delays-pile-amid-faa-budget-cuts-222819836--finance.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAI5gJw7TsvbMWk7GfF2UTv8tzkyZODYE0JiPPIDeu78B_R3IrEcQ_pYl9C77FVIP3t9ufAVm779rQKzwBcXjRGMryD3h7qTUzvxdGtDyNUBBRWewBpM_Q1ZDh2hCmlUZ1l18AywWXxqThFYKAvq9Zi_oF2l-GDVFr1rH4S1Kuuza [Accessed on 06 December 2022].